Non-Conservative Diffusion and its Application to Social Network Analysis 



Rumi Ghosh, Kristina Lerman, Tawan Surachawala Konstantin Voevodski Shanghua Teng 

use Information Sciences Institute Department of Computer Science University of Southern California 

Marina del Key, CA 90292, USA Boston University Los Angeles, CA 90007, USA 

{ rumig, lerman, tawans } @ isi. edu kvodski@ gmail. com shanghua @ use. edu 



Abstract — Is the random walk appropriate for modeling and 
analyzing social processes? We argue that many interesting 
social phenomena, including epidemics and information diffu- 
sion, cannot be modeled as a random walk, but instead must be 
modeled as broadcast-based or non-conservative diffusion. To 
produce meaningful results, social network analysis algorithms 
have to take into account differences between these diffusion 
processes. We formulate conservative (random walk-based) 
and non-conservative (broadcast-based) diffusion mathemati- 
cally and show how these are related to well-known metrics: 
PageRank and Alpha-Centrality respectively. This formulation 
allows us to unify two distinct areas of network analysis 
— centrality and epidemic models — and leads to insights 
into the relationship between diffusion and network structure, 
specifically, the existence of an epidemic threshold in non- 
conservative diffusion. We demonstrate, by ranking nodes in 
an online social network used for broadcasting news, that 
non-conservative Alpha-Centrality leads to a better agreement 
with empirical ranking schemes than conservative PageRank. 
In addition, we give a scalable approximate algorithm for 
computing the Alpha-Centrality in a massive graph. We hope 
that our investigation will inspire further exploration of the 
applications of non-conservative diffusion in social network 
analysis. 
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I. Introduction 

Social network analysis algorithms examine the topology 
of the network to identify central nodes within it or groups 
of tightly connected nodes. In many cases, these algorithms 
make implicit assumptions about the underlying diffusion 
process taking place on the network (T\. Some of the best- 
known algorithms used for graph partitioning [2J and rank- 
ing, including PageRank and its variants Q, are based 
on the random walk |5|, |6|. A random walk on a graph 
is a stochastic process which starts at some node, and at 
each time step randomly selects one of the neighbors of the 
current node. The random walk is used to model chemical 
diffusion and other physical processes in which the total 
amount of the diffusing substance remains constant. How- 
ever, the random walk may not be appropriate for modeling 
phenomena of greatest interest to social scientists, including 
adoption of innovation |j7), |[8), the spread of epidemics ||9j, 
{KFI and word-of-mouth recommendations flTl, viral mar- 
keting campaigns | ,12) , p3) , and information diffusion p4) . 



These examples are modeled as contact processes, where 
an activated or "infected" node activates its neighbors with 
some probability. Rather than picking one of the neighbors, 
in these stochastic processes each node broadcasts to all 
its neighbors. Therefore, unlike the random walk, which 
conserves the amount of substance diffusing on the network, 
contact processes are fundamentally non-conservative. When 
an idea, information, or disease spreads from one individual 
to her neighbors, the amount of information or disease 
changes (Chapter 5, p3]). If the random walk cannot model 
these social processes, can we trust results of social network 
analysis algorithms that are based on the random walk? If 
not, what are the appropriate metrics and methods to use 
for network analysis? And how can we empirically evaluate 
their performance? 

In this paper we present a mathematical formulation of 
conservative and non-conservative diffusion and demonstrate 
how these are related to two well-known centrality metrics 
used to rank nodes in a network: PageRank [!3| and Alpha- 
Centrality |16|. While PageRank is known to be equivalent 
to conservative diffusion ||5|, ||6|, we show that Alpha- 
Centrality is related to non-conservative diffusion, of which 
epidemic models are the best known example. Our formula- 
tion unifies two distinct research areas within network analy- 
sis — centrality measures and epidemic models — and leads 
to insights into relationship between dynamic processes and 
network structure. One consequence of the analysis is the 
existence of a threshold, called epidemic threshold flT], 
below which non-conservative diffusion dies out, but above 
which it reaches significant fraction of nodes within the 
network. We elucidate connection between the properties of 
Alpha-Centrality and the location of the epidemic threshold. 

We demonstrate empirically that the choice of the centra- 
lity metric impacts our ability to identify central or influ- 
ential nodes within a network. Specifically, we study online 
social network of Digg involved in spreading news stories. 
The spread of news on Digg can be modeled as an epi- 
demic process (TE], and hence represents non-conservative 
diffusion. One benefit of using social media data sets is 
that user activity on these sites provides an independent 
measure of influence. We define two empirical measures 
of influence that serve as the ground truth for ranking 



users within this social network. We compare the rankings 
produced by different centrality metrics to the ground truth 
and show that non-conservative Alpha-Centrahty leads to 
a better agreement with the ground truth than conservative 
PageRank. Finally, we present an approximate algorithm that 
can efficiently compute Alpha-Centrality for massive graphs 
and give a proof of its performance guarantees. 

Specifically, the paper makes the following contributions: 

• Define and classify diffusion processes occurring on 
networks (Section [III). 

• Establish a connection between diffusion and network 



structure (Section IIIi. We also show how centrality 



metrics are related to diffusion processes occurring on 
the network. 

* Empirically validate the hypothesis that non- 
conservative metric better predicts central people 
in an online social network used for (non-conservative) 
information diffusion than a conservative metric 
(Section [IVl). 

• Provide a fast approximate algorithm to compute 
Alpha-Centrality (Section |V]). 

II. Classes of Diffusion Processes 

We represent a network by a directed, weighted graph 
G ~ [V^E) with V nodes and E edges. We use z«[M,ti] to 
specify the weight of the edge from u to v. The adjacency 
matrix of the graph is defined as: — wlu^v] if 

(w, v) G E; otherwise, A[u, v] — 0. N{u) is the set of out- 
neighbors of u: N{u) — {v ^ V\{u,v) £ E}, (iout(w) is 
the out-degi-ee of u: doutiu) = J2veN{u) w[u,v], and c?max 
is the maximum out-degree of any node in the graph. Note 
that the Li-norm of any argument is given by ||.||i . 

Network diffusion is a dynamic stochastic process that 
distributes some quantity, which we generically refer to 
as weight, on a network or a graph. Diffusion process is 
described mathematically by a function _F : (_R+U{0})l^l 

U {0})l^l, i.e., a map from a |F| -dimensional non- 
negative vector to a -dimensional non-negative vector 
(here V is the number of nodes). The vector x e U 
{0})l^l represents the weight each node has at time t. The 
function F{x) maps the weight vector at time t to the weight 
vector at time t + 1. 

A. Conservative Diffusion 

We call a stochastic process Ct : U {0})l^l ^ 

U {0})l^l that simply redistributes the weights among 
the nodes of the graph, with the total weight remaining con- 
stant, conservative diffusion. In other words, in conservative 
diffusion for all x e {R+ U {0})l^l, ||a;||i = ||Ct(x)||i. 

To motivate our mathematical formulation of conservative 
diffusion, we imagine a hypothetical society where each 
member has some amount of money to redistribute. If 
money cannot be created or destroyed, money redistribution 
represents a conservative diffusion process. Let Xc{t) be 



the vector representing the amount of money each member 
has at time t, and A{t) represent the amount they receive 
at time t. We consider a distribution process where the 
amount redistributed at each step, depends on the money 
each member received in the previous step. We focus on 
this redistribution process, because, as we show later, this is 
the process underlying popular network models. A different 
conservative process could be one in which the amount 
redistributed in each step depends on the amount each 
member had in the previous time step. This would lead to a 
different mathematical formulation of the diffusion process. 

At time t+1, each member retains a fraction (1 — a), with 
< a < 1, of this amount and distributes the rest among 
its neighbors. Let Wc be the transfer matrix, with >Vcb, 
representing the fraction of the amount to be redistributed 
by node p transferred to q. Therefore, the amount of money 
nodes receive at time t + 1 via redistribution can be written 
as: 

A(< + 1) = aA(i)Wc. 

Thus the transfer matrix encodes the rules of diffusion. 
If each member divides aA{t) equally amongst her out- 
neighbors, then Wc = D^^A, where the degree matrix D 
is a diagonal matrix of out-degrees, and A is the adjacency 
matrix. 

Step by step, conservative diffusion looks as follows. 
Initially, at time t — 0, let the weight each node receives be 
A(0) = Xc{0). Let the process begin at time t = 1, when 
each node keeps (1 — a) of that amount and divides the rest 
(aA(0)) evenly between its out-neighbors. The amount that 
out-neighbors receive from redistribution at time t = 1 is 
A(l) = aA(0)Wc = aA'c(0)Wc. 

At time t = 2, each node retains (1 — a) of the amount 
A(l) it received at time t = 1, and divides the rest among 
its out-neighbors. Therefore, the amount received by the out- 
neighbors is A(2) = aA(l)>Vc = a^Xc{0)Wc^. 

Continuing with this process further, at any time t > 0, 
each nodes retains (1 — a) of the amount of it received at 
time t — 1, 



(l-a)A(i-l) 



(l-a)aA(t-2)>Vc 



(1) 



and divides the rest among her out-neighbors. Hence, the 
amount received by the out-neighbors is 



A{t) = aA{t - l)Wc = a*A'e(0)>Vc*. 



(2) 



The total weight (or amount of money in our example) 
the nodes have at time t, Xc{t), is the amount they retained 
from all previous time steps and the amount they receive 



from in-neighbors at time t: 
t-i 

X,{t) = (l-a)^A(fc) + A(t) 

fe=0 

t-i 

= ^(l-a)a'=A'c(0)Wc''+a*A'c(0)Wc' 

fc=0 

= (l-a)A'c(0) + aA'e(i-l)Wc. (3) 

As i — > oo, this equation reduces to 

X^it^oo) = (l-a)A'c(0) + aA'c(i^cx))Wc 

= (l-a)A'c(0)(/-aWc)"' (4) 

The transfer matrix Wc is a stochastic matrix, since its 
rows sum up to 1. If, as described above, the weight to be 
redistributed at each step is divided equally between the out- 
neighbors, then Wc = D^^A. However, if instead each node 
decides to keep a portion 6 of this amount, this leads to a 
more general form of the transfer matrix: 

}Vc = SI+{l-6)D-^A. (5) 

Note that in our hypothetical society, the total amount of 
money remains constant: if Ct : Xc (0) — ?► Xc (t) defines a dif- 
fusion process, then ||A'c(0)||i = ||Ct(A'c(0))||i. Hence this 
is a conservative diffusion process. In the above scenario, Ct 
is a linear mapping; therefore, we call the diffusion processes 
given by Eqs. [3] and [4] linear conservative diffusion. In a 
more general representation, Ct can even be a non-linear 
mapping, describing non-linear conservative diffusion. 

Random Walk as Conservative Diffusion: Like money 
transfer, a random walk on a graph can be modeled as a 
conservative diffusion process, since the probability to find 
a random walker on any node of the graph is always one. A 
random walk with random jumps or restarts can be described 
mathematically as follows. Let the initial probability to 
find the random walker on any node be uniform, i.e., 
A'c(0)[z] = |yy. At any time t, with probability a the random 
walker at node i chooses one of the neighbors of i uniformly 
at random and jumps to it. With probability (1 — a), it 
chooses any node on the graph uniformly at random and 
jumps to it. Let matrix X encode the probability of jumping 
to any node, ^[i, j] — and Wc ~ D^^A. Then the 
probability of finding the random walker at node j at time 
t is given by 

Xc{t) = {I- a)Xc{t-l)X + aXc{t-l)Wc 
= {I - a)Xc{{)) + aXcit ~ l)Wc. 

This is exactly the same as Eq. [3] Therefore, a random walk 
with a uniform starting vector is mathematically equivalent 
to a linear conservative diffusion process. 



B. Non-Conservative Diffusion 

A diffusion process where the total weight can change 
in time is a non-conservative diffusion process. Formally, a 
function Nt ■■ U {0})l^l ^ {R+ U {0})l^l defines a 
non-conservative diffusion process if for some x G U 
{0})l^l, ||x||i^||A/;(x)||i. 

To illustrate the difference between conservative and non- 
conservative processes, we return to our hypothetical society. 
Again, imagine that each member has some amount of 
money, however, unlike the previous example, each member 
also has a money printing machine, so that instead of 
dividing the money she receives equally between her out- 
neighbors, she can give each neighbor the same amount by 
printing extra money as needed. 

Let A(t) be the vector representing the amount of money 
each member receives at time t. At the next time step, each 
member prints a fraction a of this amount to give to each of 
her out-neighbors. The additional amount that she produces 
for her out-neighbors can be expressed using the replication 
matrix Wn = A. Therefore, A(i + 1) = al\{t)Wn- 

Initially, let A(0) = A'n(O). At time t = 1, each member 
prints q;A(0) for each of her out-neighbors: 

A(l) = aA(0)W„ = aA'n(0)W„. 

Similarly, at time t — 2, 

A(2) = aA(l)W„ = a^X^{Q)Wn^. 

Continuing this process, additional amount of money each 
member produces or receives at time t is: 

A(t) ^ aA{t - l)Wn ^ a*A'„(0)Wn* (6) 

Therefore, the total amount that each member has at time t is 
obtained by summing up the additional amount she accrues 
or receives from her in-neighbors at each time step: 

t t 
Xn{t) = ^A(fc)=^A'„(0)(aW„)'= 

= A'„(0) + aA'n(t-l)W„ (7) 
At time t ^ oo, Eq. |7] reduces to 

A'„(<^^)=A'n(0)^(aWn)'= (8) 

k=0 

which can be solved to yield 

A'n(i^oo) = A'„(0) + A'„(t-> cx))(aW„) 

= A'„(0)(/-aWn)"'. (9) 

This expression is defined for a < 1/Ai, where Ai is the 
largest eigenvalue, or spectral radius, of Wn- 

More generally, if along with producing a of what it 
receives from each of its in-neighbors, a node also produces 



a portion 6 of this amount for itself, this results in a more 
general form of the replication matrix: 



a 



A. 



(10) 



The diffusion process defined by Eqns. |7J|9] is non- 
conservative, since ||A'n(0)||i ^ ||A/'t(A'n(0))||i. Moreover, 
it is linear, although the function Aft may also be non-linear. 

We can model non-conservative diffusion as a random 
walk with birth, where at each time step, the random walker 
gives birth to one or more new walkers. The number of 
random walkers on the network, therefore, will change 
with time. Several social phenomena can be modeled using 
this framework. In rumor propagation, for example, some 
information spreads in a community as people pass it to 
their neighbors. This process is non-conservative, since the 
number of informed individuals grows in time. We can 
model rumor propagation as a random walk on the friendship 
graph, where the random walker (rumor) randomly selects 
one of the neighbors of the informed node to move to, while 
leaving a clone of itself at the node. Cloning is required 
for the node to remain informed. If the informed node 
immediately forgot the rumor (no cloning required), than 
rumor propagation could be modeled by a simple random 
walk and would be conservative in nature, since the number 
of informed individuals would always be one. 

Epidemics as Non-Conservative Diffusion: Non- 
conservative diffusion provides a useful framework for 
thinking about epidemics and other spreading processes and 
leads to insights into the relation between network structure 
and dynamics of spreading processes. In a spreading 
process, information or virus spreads from an informed or 
infected individual to her network neighbors. In order to 
model a spreading process accurately, the structure of the 
underlying network has to be taken into account. Wang et 



al. 1 17 1 modified existing SIS models p9) to take network 
structure into account in order to describe the spread 
of epidemics in real networks. We demonstrate that this 
model is equivalent to the linear non-conservative diffusion 
process (Equation |7]l. 

Consider a virus spreading on a network, where at each 
time step, a node infected with the virus may infect its out- 
neighbors with probability /i (virus birth rate). At each time 
step, an infected node may also be cured with probability 
/3 (virus curing rate). Wang et al. pT) showed that the 
probability pi,t that node i is infected at time t can be written 
in matrix notation as 

Ft =Pt_i((l -/?)/ + = ^o((l -/?)/ + M)* 

where Pt is a vector {pi,t, P2,t, ■ ■ ■), and Pq is the 
initial probability of infection}^ This formulation makes 

'This model holds true only when Pi t is very small and there may be 
situations where t > 1. Therefore a more accurate interpretation is that 
the probability of infection is proportional to Pi^t- 



the probability of infection at time t, Pt, exactly equal to 
the additional weight, A(t), accrued at each step in non- 
conservative diffusion, as shown in Eq.|6]with the replication 
matrix Wn = ^—^I + A and a = fi. In the model described 
above, there exists an epidemic threshold t such that for 
n/fi <T epidemic will die out, and /i//? > r it will spread 
to a significant fraction of nodes p7) . For any graph, this 
threshold is given by the inverse of the largest eigenvalue of 
the graph's adjacency matrix A: t = l/|Ai|. 

III. Diffusion and Network Structure 

The complex interplay between network structure and 
diffusion has broad implications for modeling and under- 
standing networks. While it is known that the macroscopic 
properties of diffusion (e.g., epidemic threshold) are affected 
by network structure pT) , pO| , the impact of diffusion on 
our understanding of network structure is less appreciated. In 
this paper we show that social network analysis, specifically, 
identifying central or influential nodes, is affected by the 
characteristics of the diffusion process occmiing on the 
network. Centrality metrics used for this task examine the 
topology of the network only. However, these metrics usually 
make implicit assumptions about the nature of diffusion 
process taking place on the network ||T|, with each metric 
leading to a different, even conflicting notion, of who 
the central nodes are. We show that the characteristics of 
network diffusion should be one of the guiding principles in 
choosing an appropriate network analysis algorithm. 

A. Centrality and Diffusion 

A node's centrality predicts its relative importance, influ- 
ence, or prestige within the network. Over the years many 
different centrality metrics have been introduced for social 
network analysis, including degree centrality, betweenness 
centrality |21|, eigenvector centrality |22], PageRank |3 | and 
Alpha-CentraHty |23|. 

1) Page Rank: A PageRank vector prQ(s, t) is the steady 
state probability distribution of a random walk with damping 
factor a (restart probability= 1 — a). The starting vector 
s, gives the probability distribution for where the walk 
transitions after restarting. The transfer matrix encodes the 
transition probabilities of a random walk on the network, 
W ~ D^^A. PageRank is the unique steady state solution 
pr„(s,oo) of: 

pr„(s,<) = (l-a)s + apr^{s,t-l)W (11) 

For ease of convention, we denote PageRank by pr|_j(s). 
Hence 

pr^{s) = {l-a)s + aprJs)W (12) 

Equation [12] is identical to the steady state solution of the 
linear conservative diffusion process given by Eq. |4] where 
W ^Wc ^ D-^A and s = A'c(O). Therefore, PageRank 
is the steady state solution of conservative diffusion, and 
PageRank is a conservative metric. Most of the other metrics 



derived from the random walk make an implicit assumption 
of conservative diffusion taking place on a network. 



2} Alpha-Centmlity: Alpha-Centrality |23| measures the 
total number of paths from a node, exponentially attenuated 
by their length. For a starting vector s and attenuation 
parameter a, the Alpha-Centrality vector is the steady state 
solution to: 



cra(s, t) = s + acra(s, t — \)A. 



(13) 



The starting vector s is usually taken as in-degree centra- 
lity f2T|. For ease of convention, we shall denote cva{s, t 
oo) by cra(s). As t — > oo, the solution converges to 



cr„(s) 



acTa{s)A, 



(14) 



1 



which holds while |q!| < 

One difficulty in applying Alpha-Centrality in network 
analysis is that its key parameter a is bounded by Ai, the 
spectral radius of the network. As a result, the metric di- 
verges at this value of the parameter. To overcome this, nor- 
malized Alpha-Centrality |24| has been recently introduced, 
which we denote by ncia{s,t). It normalizes the score of 
each node by the sum of the Alpha-Centrality scores of all 
the nodes. The new metric avoids the problem of bounded 
parameters while retaining the desirable characteristics of 
Alpha-Centrality, namely its ability to differentiate between 
local and global structures. 

Normalized Alpha-Centrality ncr a {s,t — > cx)) is defined 
using the system of equations shown below: 

ncr„(s,t) = -Tj / ^,|| cra{s,t) (15) 

\\CTa[S,t)\\i 

The new metric is well defined for a > {a ^ jXTf)' 

Equation [T3] and Eq. [TSjare mathematically equivalent to 
Eq. |8] with starting vector X^{0) — c ■ s, where c = 1 for 
Alpha-Centrality and 

1 

for normalized Alpha-Centrality. Therefore, Alpha- 
Centrality is a steady state solution of linear non- 
conservative diffusion and is a non-conservative metric. 
Other non-conservative metrics include degree centrality, 
Katz score |25|, SenderRank |26|, and eigenvector 
centraUty p6). 



B. Length Scales and Epidemic Threshold 

The link between Alpha-Centrality (and normalized 
Alpha-Centrality) and non-conservative diffusion leads to a 
fundamental insight into the relationship between network 
structure and the size of epidemics. Let us look more 
carefully at Equation|8] The weight distribution given by this 
equation depends on the initial weight distribution (A'n(O)) 
and the power series of matrices S(a^t) = X)A;=o('^^n)'^- 
For illustrative purposes, we can interpret to be the 



adjacency matrix of some graph G' . Then each element 
in the power series S{a,t)[i, j] can be interpreted as the 
number of attenuated paths from node i to node j up to 
length t in that graph G' . In Alpha-Centrality or normalized 
Alpha-Centrality, these paths determine the centrality of the 
node along with the initial distribution of weights. The 
probability of non-conservative diffusion reaching node j 
from i through a path of length k is a^. S{a,t)[i^ j] then 
characterizes the expected number times a non-conservative 
diffusion process initiated at node i reaches node j up until 
time t. For example, let node i be infected by a virus and 
initiate a viral infection in the network. If viral infection 
can be modeled as linear non-conservative diffusion (Section 



II-B I, the probability that node j will get infected by the 
viral infection from node i through a path of length k 
would be a'^ . Then 5'(a, j] would quantify the expected 
number of viruses reaching node j when the viral infection 
is initiated at node i. 

As shown in the Appendix, the expected path length of 
diffusion as t — > oo, is .^^^^^ if a < and 0{t) if 
a > Tt^. Therefore, r^, 

|Ai| 



1 

1-qAi " " ^ |Ai| 

is a threshold: for a below thresh- 



old, the expected path length converges with time, while for 
a above the threshold, it diverges. Note that this threshold 



is equivalent to the epidemic threshold (Section II-B i. Thus 



from the diffusion point of view, given the network structure 
and nature of diffusion, a (for a < 1/Ai) determines how 
far, on average, a node's effect will be felt and sets the 
length scale of the interaction. When a is small, Alpha- 
Centrality or normalized Alpha-Centrality probes only the 
local structure of the network. As a grows, structurally 
longer paths become more important, (normalized) Alpha- 
Centrality becomes a global measure and the weight diffuses 
to a greater number of nodes. 

C. Choosing the Centrality Metric 

When applied to the same network, different centrality 
metrics may lead to different, often incompatible, views of 
who the important nodes are. We illustrate these differences 
on a toy network shown in Fig. [T] where a link from node 
u to node v indicates that node v is an out-neighbor of u, 
e.g., u is a follower of v in an online social network. 




^2 



Figure 1. An example network, where node 1 has the highest Alpha- 
Centrality followed by node 3. In contrast node 3 has the highest PageRank 
followed by node 1. 



Even in this simple example, PageRank and Alpha- 
Centrality disagree about who the most important node is. 
PageRank without restarts ranks node 3 highest, followed 
by node 1 . In contrast, Alpha-Centrality ranks node 1 above 
node 3. The difference in rankings produced by the two 
centrality metrics is due to the difference in the underlying 
diffusion process that redistributes the weights of the nodes. 
Assume that all nodes start with equal weights, which then 
evolve according to the rules of diffusion. In PageRank 
without restarts (damping factor a = 1), each follower 
divides its weight equally among its do^i out-neighbors, 
and hence transfers a fraction l/dout to each. Thus, node 
5 contributes 1/3 of its weight to node 1, and so will node 
8. Node 3, on the other hand, will get the entire weight of 
node 4, giving it a higher weight than node 1 and therefore, 
a higher rank. 

In contrast to PageRank, Alpha-Centrality has nodes up- 
date their weights by copying a portion of their followers' 
weights. For consistency with PageRank, we take a = 1. 
Thus, node 1 will receive the entire weights of nodes 2, 5 
and 8, while node 3 will only receive the weights from nodes 
2 and 4. Therefore, the weight of node 1 will be greater than 
node 3, and consequently, it will be ranked higher by Alpha- 
Centrality. 

Which ranking is right? How do we choose the right 
centrality metric for our problem? We claim that the choice 
of the centrality metric has to be motivated by details of the 
diffusion process taking place on the network. To analyze 
networks on which processes such as random walk, web 
surfing, money and used goods exchange are taking place, 
conservative metrics, such as PageRank, are appropriate. On 
the other hand, to study social networks on which informa- 
tion or epidemics are spreading, non-conservative metrics, 
such as Alpha-Centrality, should be used. In other words, 
the centrality metric that best predicts important nodes in 
a network is one whose implicit dynamics most closely 
matches the diffusion process occurring on the network. 

IV. Predicting Influentials in Online Social 
Networks 

Online social networks on sites such as Facebook, Twitter, 
and Digg have become important hubs of social activity 
and conduits of information. The ever-growing popularity 
of these networks and overwhelming amount of information 
contained in them, necessitates the need for a more princi- 
pled approach to social network analysis and data mining. 
Correctly identifying influential nodes on these networks can 
have far-reaching consequences for identifying noteworthy 
content |27|, targeted information dissemination 1121, and 
other appUcations. While a variety of methods |28| , p9) 
have been used to identify influential users in online social 
networks, each metric leads to a different result, and no 
justification for these metrics have been proposed. 

Fortunately, by exposing activity of their users, online 
social networks provide a unique opportunity to study dy- 



namic processes on networks. We analyze information flow 
on the social news aggregator Digg and use this data to 
empirically evaluate centrality metrics. By posting a story 
on Digg, submitter broadcasts it to her followers. When 
another user votes for this story, she broadcasts it to her own 
followers. We claim that since broadcast-driven information 
diffusion on Digg is non-conservative in nature, a non- 
conservative metric will better identify influential users than 
a conservative metric. 

The Digg dataset comprises around 300K users and over 
1 million friendship links, from which we can extract the 
directed follower network of active users. These users were 
active in spreading stories on Digg by either submitting them 
or voting for them, since both activities expose the story to 
the submitter or voter's followers. The data set contains more 
than 3 million votes on more than 3000 stories promoted to 
Digg's front page in June 2009. Note that the underlying 
follower graph was extracted separately of user activity. 
In fact, user activity provides an independent measure of 
influence in online social networks that we use to evaluate 
the centrality metrics. 

A. Empirical Estimates of Influence 

Katz and Lazarsfeld | |30l defined influentials as "indi- 
viduals who were likely to influence other persons in their 
immediate environment." In the years that followed, many 
attempts were made to identify people who influenced others 
to adopt a new practice or product by looking at how 
innovations or word-of-mouth recommendations spread {31) . 
The rise of online social networks has allowed researchers 
to trace the flow of information through social links on a 
massive scale. Using the new empirical foundation, some 
researchers proposed to measure a person's influence by the 
size of the cascade he or she triggers |12|. However, as 
Watts and Dodds [32J note, "the ability of any individual 
to trigger a cascade depends much more on the global 
structure of the influence network than on his or her personal 
degree of influence." Alternatively, Trusov et al. [33J defined 
influential people in an online social network as those whose 
activity stimulates those connected to them to increase their 
activity, while Cha et al. |28 | used the number of retweets 
and mentions to measure user influence on Twitter. 

Motivated by these works, we measure influence by 
analyzing users' activity on an online social network. Sup- 
pose some user, the submitter, posts a new story on Digg. 
We measure the activity submitter's post generates by the 
number of times it is re-broadcast by followers. Whether or 
not a user will re-broadcast the story depends on (/) story 
quality and (;/) influence of the submitter. We assume that 
story's quality is uncorrected with the submitterj^Therefore, 
we can average out its contribution to the activity a submitter 
generates by aggregating over all stories submitted by the 

^This is a fairly strong assumption, but it appeal's to hold at least for 
Digg ^27J. 



same user. We claim that the residual difference between 
submitters can be attributed to variations in influence. We 
propose two metrics to measure submitter's influence: (i) 
average number of follower votes her posts generate and 
(ii) average size of the cascades her posts trigger. 

B. Comparison of Centrality Metrics 
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Figure 2. Correlation between tlie rankings produced by the empirical 
measures of influence and those predicted by normalized Alpha-Centrality 
and PageRank. We use (a) the average number of follower votes and (b) 
average cascade size as the empirical measures of influence. The inset 
zooms into the variation in correlation for < a < 0.01 



We use the empirical estimates of influence to rank a 
subset of users in our sample who submitted more than one 
story which received at least 100 votes. There were 289 
Digg users in this sample. We used the rankings produced 
by either empirical estimate as the ground truth to evaluate 
the performance of different centrality metrics. We studied 
PageRank (with uniform starting vector) and normalized 
Alpha-Centrality, both of which were computed considering 
the entire Digg follower network as a graph, with the thou- 
sands of users as nodes and the millions of friendship links 
as edges. We use Pearson's correlation coefficient (since 
ties in rank may exist) to compare the rankings predicted 



by the different centrality metrics with the ground truth. 
Figure |2] shows how the correlation in rankings changes 
with the parameter < a < 1. This parameter stands 
for the attenuation factor for normalized Alpha-Centrality 



(see Equation 13 i and the damping factor {restart proba- 
bility=l — a) for PageRank (see Equation 111. If we used 
Alpha-CentraUty instead of normalized Alpha-Centrality, we 
would have been bounded by its formalization, to compute 
the rankings only for a < Note that the correlation of 
PageRank at a = (restart probability=l) with the empirical 
estimate cannot be computed because standard deviation 
of PageRank rankings would be zero in this case. Various 
studies have tested different damping factors for Page Rank, 
but it is generally assumed that the damping factor should be 
set around a = 0.85 |3|. Boldi et al. |34| claim that in case 
of PageRank, "for real-world graphs values of a close to 1 
do not give a more meaningful ranking." Except for values 
a close to 1, the influence rankings calculated from Alpha- 
Centrality correlated better with the empirical estimates of 
influence rankings than PageRank rankings. Therefore, we 
conclude that Alpha-Centrality predicts central users in the 
Digg social network better than PageRank. 

V. Approximation Algorithm for 
Alpha-Centrality 

In order to compute the exact Alpha-Centrality vector we 



have to solve Equation 13 which requires us to compute 
a matrix inverse. Computing a matrix inverse in a naive 
implementation, takes 0{n^) time (where n is the number of 
nodes in the network), so this is difficult to compute for large 
networks. One way to compute an approximate solution is 
to use the alternate formulation given in Equation |7] and 



compute s{I + aA + c? 



3^3 



.), until the a* 



coefficient grows sufficiently small. While this technique is 
effective in practice, computing A* in each iteration, using 
a naive implementation would have must take at least v? 
time, and it is not clear how many iterations we need to 
get a good approximation. In this section we present an 
algorithm for approximating Alpha-Centrality, which has 
a single parameter that controls both the runtime and the 
quality of the produced approximation. 

A description of our algorithm is given in Algorithm [T] 
Our procedure is similar to the algorithm for approximating 
PageRank that is given in Q. Our algorithm takes the 
network, the starting vector s, a, and an approximation 
parameter 5 (0 < (5 < 1) as input, and computes an 
approximate Alpha Centrality vector where each entry has 
error of at most b (see Theorem |2|. In order to approximate 
a centrality vector with starting vector s, we maintain an 
approximate centrality vector dr and a residual vector r. 
Initially r is equivalent to the starting vector s; the algorithm 
iteratively moves content from r to cr until each entry in r 
is small. 

When the a parameter is fixed, we use cr(s) to denote 
crQ,(s). We will also use [cr(s)](u) to refer to how much 



Algorithm 1 Approximate-Centrality(T^, E, s, a, S) 

1: e — S\\s\\i/n; 

2: r = s; 

3: Queue q = new Queue(); 

4: for each u €V do 

5: cr{u) = 0; 

6: if r(M) > e then 

7: q.add(u); 

8: end if 

9: end for 

10: while q.sizeO > do 

11: u = q.dequeueO; 

12: cr{u) — cr{u) + r{u); 

13: T^a-r{u); 

14: r{u) = 0; 

15: for each v G N{u) do 

16: r(w) = r{v) + T • w(u, v); 

17: if !q.contains('i;) and r{v) > e then 

18: q.add(w); 

19: end if 

20: end for 

21: end while 

22: return cr; 



content vertex u has in cr(s). We give our formal per- 
formance guarantee for Algorithm [T] in Theorem |2] This 
performance guarantee is based on Lemma [T[ which shows 
that in any step of the algorithm, the approximate centrality 
computed for Alpha-Centrality with s as starting vector, is 
always exactly equivalent to Alpha Centrality with s — r as 
starting vector, where r is the residual vector in that step i.e. 
throughout the execution of the algorithm, the error in the 
approximate centrality vector is dependent on the amount of 
content remaining in the residual vector. 

Our arguments depend on the linearity of the centrality 
computation with respect to the starting vector, which is easy 
to verify. We can show that CTa{si)+cra{s2) — ci-a{si+S2), 
and c • crQ,(s) = ciaic • s). 

Lemma 1: The invariant cr = cr(s — r) is maintained 
throughout the execution of the while-loop. 

Proof: Before the loop starts, we have r = s and cr = 
0, so cr(s — r) = cr(0) = = cr. We can also show that if 
cr = cr(s — r) holds prior to an iteration of the loop, then 
cr' = cr(s — r') is still true after the iteration, where cr' 
and r' are the updated approximate centrality and residual 
vectors. 

We first observe that cr(s)A = ci{sA). To see this, 
consider that by definition cr(s) = s+a-cr(s)A. Multiplying 
this equation by A we get cr(s)^ — sA + a ■ {a{s)A)A. 
This shows that a:{s)A is by definition a centrality vector for 
starting vector sA. Moreover, we know that the solution to 
cr(sj4) is unique, so we have ci{s)A = ct{sA). This obser- 
vation shows that we can iteratively compute the centrality 
vector by expressing cr(s)^ as cr(s^). 



We will write the operations performed inside the while- 
loop using vector-matrix notation. We use e„ to denote a 
row vector that has all of its content in vertex u: e„(i) = 1 
if i = u; otherwise, eu{i) = 0. 

After an iteration of the loop we have cr' = cr + r(u)e„, 
and r' = r — r{u)eu + ar{u)euA, where u is the vertex 
that is dequeued in line 1 1 . We next specify the relationship 
between the approximate centrality and residual vectors 
before and after an iteration of the while-loop. Consider that 

cr(r) = crfr — r(u)eu) + cr(r(u)e„) 

= cr(r — r(w)e„) + r(w)e„ + CT{ar{u)euA) 

= cr(r — r(w)e„ + ar{u)euA) + r(u)e„ 

= cr(r') + cr' — cr. 

If cr = cr(s — r), we have cr(r) = cr(r')+cr' — cr(s — r). 
It follows that cr' — cr(r) — cr(r') + cr(s — r) — cr(r — r' + 
(s — r)) = cr(s — r'). ■ 

Theorem 2: Given an a < for some c < 1 and a uni- 
form starting vector s, the vector cr output by Approximate- 
Centrality satisfies [cr(s)](M) > criu) > [cr(s)](it)(l — 5) 
for each vertex u G V. The runtime of the algorithm is 
0(-f] ) 

^ V ,5 ^max } ■ 

Proof: Lemma [T] argues that cr ~ cr(s — r) = cr(s) — 
cr(r) throughout the execution of the algorithm, so we have 
cr{u) = [cr(s)](u) — [cr(r)](M) for all vertices u £ V. Given 
a uniform starting vector s, s{u) = ||s||i/n for all u G 
V. The algorithm terminates when r{u) < e for all u G 
V, so we choose e = S ■ ||s||i/n — Ss{u) such that upon 
completion r{u) < ds{u) for all u E V. 

Clearly, [cr(s)](M) > cr{u) because r and cr(r) are non- 
negative. We can also show that given that r{u) < Ss{u) for 
all u G V, [cr(r)](u) < (5[cr(s)](u) for all vertices u G V.lt 
follows that cr(u) = [cr(s)](u) — [cr(r)](w) > [cr(s)](M)(l — 
6). Therefore we can see that indeed [cr(s)](u) > cr{u) > 
[cr(s)](u)(l — S) for all vertices u G V. 

We assume that a is chosen such that a < for 
some constant c < 1, where d,„ax is the largest out- 
degree of any node in the graph. In order to bound the 
runtime of the algorithm, consider that each iteration of 
the while-loop decreases the sum of the entries of r by 
{l-a-doutiu))r{u) > (l-a-(iout(u))e > (l-a-(imax)e > 
(1 — c)e. Because r = s at initialization and each iteration 
decreases ||r||i by at least (1 — c)e, the number of iterations 
i must satisfy z(l — c)e < ||s||i. Therefore the number of 
iterations may be at most — 0(||s||i/e). The cost of 

each iteration is proportional to the out-degree of the node 
that is dequeued, so the worst-case runtime of the algorithm 
is 0(||s||i/e • dmax)- For our choice of e this is equivalent 

to 0(f dmax). ■ 

A. Quality of Approximate Results 

We compare the performance of the approximate al- 
gorithm with the power iteration method in Equation [T3] 
using the indegree as the starting vector, like in [16J and 



p2| . To compute Alpha-centrality using the approximate 
algorithm, we fix e (Algorithm to be 3.57 x 10^^ and 
1.42 X 10~* guaranteeing that the error in approximation 
would be less than l%(d < 0.01). We terminate the power 
iteration algorithm after 100 iterations in Digg and 10 to 
100 iterations in Twitter. We calculate the RMS(root mean 
square) error of the approximate algorithm with respect to 
the power iteration algorithm, for different values of a. The 
RMS error averaged over all values of a, is 0.797% and 
0.75% for Digg and Twitter respectively. 

VI. Related Work 

The interplay of the structural properties of the under- 
lying network with the diffusion processes occurring in 
it, contributes to the complexity of real-life networks. For 
example in epidemiology, the dynamics of disease spread 
on a network and the epidemic threshold is closely related 
to its spectral radius of the graph |17|. Similarly, random 
walk on a graph is closely related Laplacian of the graph 
(35). 

The range of diffusion processes that can occur on a 
network includes the spread of epidemics |9|, ITO) and 
information | [T4| , viral marketing | [T2j , fl?) , word-of-mouth 
recommendation |11|, money exchange, e-mail forward- 
ing [36) , and Web surfing ||3], among others. Researchers 
have developed an arsenal of centrality metrics to study 
the properties of networks, including degree, closeness |37|, 
graph f38) and betweenness | [2Tj ; Markov process-based ran- 
dom measures like the Hubbels model ]39) ; path-based rank- 
ing measures like the Katz score |25| , SenderRank | |26J , and 
eigenvector centrality | [22) . However, as Borgatti noted |[T), 
most centrality measures make implicit assumptions about 
the diffusion process occurring on a network. In order to 
give correct predictions, these assumptions must match the 
actual dynamics of the network. Borgatti classified dynamic 
processes according to the trajectories they follow (geodesic, 
path, trail, walk) and the method of spread (transfer, serial 
or parallel duplication). We on the other hand maintain that 
a simpler classification scheme, that divides dynamic pro- 
cesses into conservative and non-conservative, captures the 
essential differences between them and informs the choice 
of the centrality metric. Apart from PageRank and Alpha- 
Centrality, other measures can be classified as conservative 
or non-conservative. 

Online social networks provide us the unique opportunity 
to study the dynamic processes occurring on networks. 
Some studies compared empirical measures, such as tweets 
and mentions on Twitter |28) , |40J , with centrality metrics 
including PageRank and in-degree centrality. We on the 
other hand, differentiate between the two distinct methods 
of quantifying influence: estimating influence by measuring 
dynamics of social network behavior and using centrality 
metrics to predict influence. In addition, we evaluate the pre- 
dictive influence models using the empirical measurements. 



Similar to personalized PageRank Q for conservative 
diffusion, each user's unique notion of importance in non- 
conservative diffusion can be captured using customized 
starting vector for individual users in Alpha-Centrality, lead- 
ing to personalized Alpha-Centrality. The use of residual 
vectors and incremental computation in the calculation of 
approximate Alpha-Centrality leads to scalability of the 
method. Moreover, as in personalized PageRank, these resid- 
ual vectors can be shared across multiple personalized views, 
scaling the personalized Alpha-Centrality metric. Analo- 
gous to approximate PageRank |2|, in approximate Alpha- 
Centrality, at each iteration residual vector is redistributed 
to reduce the difference between the Alpha-Centrality vec- 
tor and its approximate version. However, the process of 
redistribution of the residual vector mimics the kind of 
diffusion the model emulates. For approximate in PageRank, 
the redistribution of residual vectors is conservative (with the 
total weight of the residual vector conserved). On the other 
hand, in approximate Alpha-Centrality, the redistribution of 
residual vectors is not conservative. 

VII. Conclusion 

We described two fundamentally distinct diffusion pro- 
cesses, which can be mathematically differentiated based on 
whether or not they conserve the quantity that is diffusing on 
the network. Random walk, which conserves the probability 
density of the diffusing quantity, can be modeled as a con- 
servative diffusion process, while epidemics and information 
spread can be modeled as non-conservative diffusion pro- 
cess. We showed that centrality metrics, such as PageRank 
and Alpha-Centrality, can be classified as conservative or 
non-conservative based on the implicit assumptions they 
make about the redistribution of weight. We showed that 
since Alpha-Centrality is mathematically equivalent to non- 
conservative diffusion, it should be used to identify central 
nodes in online social networks whose primary function is 
to spread information, a non-conservative process. Future 
work includes applying this analysis to other online social 
networks like Twitter and exploring how diffusion process 
affect other aspects of social network analysis. Our work 
provides just the initial study of non-conservative diffusion 
— much work has to be done to understand its properties 
and extension, for example, application to personalized 
Alpha-Centrality may be productive. We hope that our work 
motivates readers to study the properties of non-conservative 
diffusion and investigate the use of non-conservative in 
social network analysis. 
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Appendix 

Replication matrix Wn can be written in terms of its 
eigenvalues and eigenvectors as: 



(16) 



where Zi is the selection matrix having zeros everywhere 
except for element {Zi),-^ — l \41]. Therefore 



fc=0 

1 + ^ 



^(-i)^(i-,.«Ar.) 

't; (-if-d-ax.) 

where li = if a |Ai| < 1 and 1^ = 1 if a |Ai| > 1. As 



obvious from above, for Equation 18 to hold non-trivially, 
a 7^ jX^^* ^ 1,2--- ,n. Now assuming |Ai| is strictly 
greater than any other eigenvalue 



S{a, t) 



(-lf^(aAi(l-a*+iA*+i)) 



(-1)^^(1 -aAi) 

For any matrix M, let ||M||i — j ^'^[ii j] Therefore, 
the expected number of paths is ||S'(a, i)||i. The expected 
path length is given by: 



Yi. 



fc=0 



A:=0 



" da 

\\S{aMi 



(-lf'(: 



1 



-(< + l) 



Q*+1A*+ 

l-a*+iA*+ 



' 1 — aAi 

Therefore, as t — > cx) and a|Ai| < 1, the expected path 
length is approximately jz^^, and for ajAij > 1 it is 0{t). 



where X is a matrix whose columns are the eigenvectors 
of Wn- A is a diagonal matrix, whose diagonal elements 
are the eigenvalues, Ka = Aj, arranged according to the 
ordering of the eigenvectors in X. Without loss of generality 
we assume that Ai > A2 > • • • > A„. The matrices Yi can 
be determined from the product 



Yi — XZiX ^ 



(17) 



